In an ever-volatile market where every dollar counts, the used car market represents a critical sector of the consumer industry. With rising consumer demand and an increasing variety of vehicles entering the secondary market, understanding the factors that influence used car prices is essential. Buyers seek to make informed decisions based on value for money, while sellers aim to maximize returns by accurately pricing their vehicles. Bridging this gap requires a data-driven approach to uncover the relationships between vehicle specifications and market pricing.
This project utilizes the “Vehicle dataset” by Nehal Birla, Nishant Verma, and Nikhil Kushwaha, available on Kaggle at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/data?select=Car+details+v3.csv.
This dataset aggregates detailed information on over 10,000 used cars, including key variables such as fuel type, transmission, engine capacity, mileage, and kilometers driven, alongside categorical variables like seller type, ownership history, and geographic location.
Our analysis is driven by three core objectives:
Using statistical modeling techniques, including regression analysis, we aim to deliver a robust and interpretable model that not only predicts car prices but also highlights the most influential factors driving price variations. Our results will shed light on market dynamics, offering actionable insights for both consumers and industry professionals navigating this volatile space.
By the conclusion of this project, we aim to provide a detailed analysis that enhances understanding of the used car market, aiding stakeholders in making informed decisions in an ever-changing economic landscape.
The dataset contains the following attributes:
| Attribute Name | Description | |
|---|---|---|
| 1 | name |
The make and model of the vehicle (e.g., Hyundai i10, Honda City). |
| 2 | year |
The year the vehicle was manufactured. |
| 3 | selling_price |
The selling price of the vehicle in Indian Rupees (INR). |
| 4 | km_driven |
The total distance the vehicle has been driven (in km). |
| 5 | fuel |
The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”). |
| 6 | seller_type |
The type of seller (“Individual”, “Dealer” or “Trustmark Dealer”). |
| 7 | transmission |
The type of transmission system (“Manual” or “Automatic”). |
| 8 | owner |
The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”). |
| 9 | mileage |
The fuel efficiency of the vehicle (in either km/l or km/kg). |
| 10 | engine |
The engine displacement capacity (in CC). |
| 11 | max_power |
The maximum power output of the vehicle’s engine, measured in brake horsepower (bhp). |
| 12 | torque |
A pair of torque and RPM values representing the max torque of the vehicle. The torque values are in either Nm or kgm, and the RPM values are either values or value ranges. |
| 13 | seats |
The seating capacity of the vehicle as an integer. |
The above attributes can be categorized into numeric and categorical variables. Numeric variables can be separated into discrete and continuous variables.
| Attribute Type | Attribute Name |
|---|---|
| Categorical Variables | name, fuel, seller_type,
transmission, owner |
| Discrete Numeric Variables | year, km_driven, seats |
| Continuous Numeric Variables | selling_price, mileage,
engine, max_power, torque |
# Set document-wide options
options(tibble.width = Inf) # Make tibbles display all columns
car_details = read_csv("Car details v3.csv")
nrow(car_details)
## [1] 8128
head(car_details)
## # A tibble: 6 × 13
## name year selling_price km_driven fuel seller_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Maruti Swift Dzire VDI 2014 450000 145500 Diesel Individual
## 2 Skoda Rapid 1.5 TDI Ambition 2014 370000 120000 Diesel Individual
## 3 Honda City 2017-2020 EXi 2006 158000 140000 Petrol Individual
## 4 Hyundai i20 Sportz Diesel 2010 225000 127000 Diesel Individual
## 5 Maruti Swift VXI BSIII 2007 130000 120000 Petrol Individual
## 6 Hyundai Xcent 1.2 VTVT E Plus 2017 440000 45000 Petrol Individual
## transmission owner mileage engine max_power
## <chr> <chr> <chr> <chr> <chr>
## 1 Manual First Owner 23.4 kmpl 1248 CC 74 bhp
## 2 Manual Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3 Manual Third Owner 17.7 kmpl 1497 CC 78 bhp
## 4 Manual First Owner 23.0 kmpl 1396 CC 90 bhp
## 5 Manual First Owner 16.1 kmpl 1298 CC 88.2 bhp
## 6 Manual First Owner 20.14 kmpl 1197 CC 81.86 bhp
## torque seats
## <chr> <dbl>
## 1 190Nm@ 2000rpm 5
## 2 250Nm@ 1500-2500rpm 5
## 3 12.7@ 2,700(kgm@ rpm) 5
## 4 22.4 kgm at 1750-2750rpm 5
## 5 11.5@ 4,500(kgm@ rpm) 5
## 6 113.75nm@ 4000rpm 5
car_details = na.omit(car_details)
nrow(car_details)
## [1] 7906
sum(duplicated(car_details))
## [1] 1189
car_details = car_details[!duplicated(car_details),]
nrow(car_details)
## [1] 6717
name variable# Extract the first word from "name" to get the make of the vehicle
car_details$name = word(car_details$name,1)
# Rename "name" variable to "make"
colnames(car_details)[1] = "make"
# Change datatypes of "make" from character to factor
car_details$make = as.factor(car_details$make)
mileage variable# View raw data for "mileage" variable
head(car_details$mileage, n=20)
## [1] "23.4 kmpl" "21.14 kmpl" "17.7 kmpl" "23.0 kmpl" "16.1 kmpl"
## [6] "20.14 kmpl" "17.3 km/kg" "16.1 kmpl" "23.59 kmpl" "20.0 kmpl"
## [11] "19.01 kmpl" "17.3 kmpl" "19.3 kmpl" "18.9 kmpl" "18.15 kmpl"
## [16] "24.52 kmpl" "23.0 kmpl" "19.7 kmpl" "22.54 kmpl" "21.0 kmpl"
# Drop 86 rows where "mileage" value contains the text "km/kg"
car_details <- car_details[!grepl("km/kg", car_details$mileage), ]
# Extract the numeric value from "mileage"
car_details$mileage = word(car_details$mileage,1)
# Change data_type of "mileage" from character to numeric
car_details$mileage=as.numeric(car_details$mileage)
# Rename "mileage" to "fuel_efficiency"
colnames(car_details)[which(names(car_details) == "mileage")] = "fuel_efficiency"
# View new data for "fuel_efficiency" variable
head(car_details$fuel_efficiency, n=20)
## [1] 23.40 21.14 17.70 23.00 16.10 20.14 16.10 23.59 20.00 19.01 17.30 19.30
## [13] 18.90 18.15 24.52 23.00 19.70 22.54 21.00 25.50
Above, the mileage variable is renamed to
fuel_efficiency to prevent confusion with the
km_driven variable, which represents the quantity normally
referred to as vehicle “mileage”.
engine variable# View raw data for "engine" variable
head(car_details$engine, n=20)
## [1] "1248 CC" "1498 CC" "1497 CC" "1396 CC" "1298 CC" "1197 CC" "796 CC"
## [8] "1364 CC" "1399 CC" "1461 CC" "993 CC" "1248 CC" "1061 CC" "1198 CC"
## [15] "1248 CC" "1396 CC" "796 CC" "1396 CC" "1461 CC" "1498 CC"
# Extract the numeric value from each entry (remove the text "CC" from the end of each entry)
car_details$engine = word(car_details$engine,1)
# Change data_type of "engine" from character to numeric
car_details$engine=as.numeric(car_details$engine)
# View new data for "engine" variable
head(car_details$engine, n=20)
## [1] 1248 1498 1497 1396 1298 1197 796 1364 1399 1461 993 1248 1061 1198 1248
## [16] 1396 796 1396 1461 1498
max_power variable# Transform "max_power" variable
# View raw data for "max_power" variable
head(car_details$max_power, n=20)
## [1] "74 bhp" "103.52 bhp" "78 bhp" "90 bhp" "88.2 bhp"
## [6] "81.86 bhp" "37 bhp" "67.1 bhp" "68.1 bhp" "108.45 bhp"
## [11] "60 bhp" "73.9 bhp" "67 bhp" "82 bhp" "88.5 bhp"
## [16] "90 bhp" "46.3 bhp" "88.73 bhp" "64.1 bhp" "98.6 bhp"
# Extract the numeric value from each entry of "max_power" (remove the text "bhp" from the end of each entry)
car_details$max_power = word(car_details$max_power,1)
# Change data_type of "max_power" from character to numeric
car_details$max_power=as.numeric(car_details$max_power)
# View new data for "max_power" variable
head(car_details$max_power, n=20)
## [1] 74.00 103.52 78.00 90.00 88.20 81.86 37.00 67.10 68.10 108.45
## [11] 60.00 73.90 67.00 82.00 88.50 90.00 46.30 88.73 64.10 98.60
# Change datatypes of "fuel", "seller_type", "transmission", and "owner"
# from character to factor
car_details$fuel=as.factor(car_details$fuel)
car_details$seller_type=as.factor(car_details$seller_type)
car_details$transmission=as.factor(car_details$transmission)
car_details$owner=as.factor(car_details$owner)
km_driven to
km_driven_in_10k# Modify the "km_driven" column by dividing the values of each entry by 10,000 and renaming the column to "km_driven_in_10k"
car_details$km_driven = car_details$km_driven/10000
# Rename km_driven
colnames(car_details)[4]="km_driven_in_10k"
# View data for "km_driven_in_10k" column
head(car_details)[4]
## # A tibble: 6 × 1
## km_driven_in_10k
## <dbl>
## 1 14.6
## 2 12
## 3 14
## 4 12.7
## 5 12
## 6 4.5
selling_price to
selling_price_in_10k# Modify "selling_price" column by dividing the values of each entry by 10,000 and renaming the column to "selling_price_in_10k"
car_details$selling_price = car_details$selling_price/10000
# Rename selling_price to "selling_price_in_10k"
colnames(car_details)[3]="selling_price_in_10k"
# View data for "selling_price_in_10k" column
head(car_details)[3]
## # A tibble: 6 × 1
## selling_price_in_10k
## <dbl>
## 1 45
## 2 37
## 3 15.8
## 4 22.5
## 5 13
## 6 44
torque column# Drop "torque" column
car_details = car_details[,!names(car_details) %in% "torque"]
make_category variable from make
variableunique(car_details$make)
## [1] Maruti Skoda Honda Hyundai Toyota
## [6] Ford Renault Mahindra Tata Chevrolet
## [11] Datsun Jeep Mercedes-Benz Mitsubishi Audi
## [16] Volkswagen BMW Nissan Lexus Jaguar
## [21] Land MG Volvo Daewoo Kia
## [26] Fiat Force Ambassador Ashok Isuzu
## [31] Opel
## 31 Levels: Ambassador Ashok Audi BMW Chevrolet Daewoo Datsun Fiat ... Volvo
# Create new column "make_category" from "make" column
car_details$make_category = ifelse(car_details$make %in%
c("Ambassador", "Ashok", "Daewoo", "Datsun", "Opel", "Fiat"),
"Budget",
ifelse(car_details$make %in%
c("Chevrolet", "Maruti", "Renault", "Mitsubishi", "Ford",
"Honda", "Hyundai", "Isuzu", "Kia", "Toyota", "Force",
"Volkswagen", "Tata", "Skoda", "Jeep", "MG", "Nissan", "Mahindra"),
"Midrange",
"Luxury"
)
)
# Convert make_category to a factor
car_details$make_category = as.factor(car_details$make_category)
# Output all levels of the make_category variable
levels(car_details$make_category)
## [1] "Budget" "Luxury" "Midrange"
# Create a summary table of makes and their categories and view the mapping
make_category_mapping <- unique(car_details[, c("make", "make_category")])
make_category_mapping_df <- as.data.frame(make_category_mapping)
print(make_category_mapping_df, row.names = FALSE)
## make make_category
## Maruti Midrange
## Skoda Midrange
## Honda Midrange
## Hyundai Midrange
## Toyota Midrange
## Ford Midrange
## Renault Midrange
## Mahindra Midrange
## Tata Midrange
## Chevrolet Midrange
## Datsun Budget
## Jeep Midrange
## Mercedes-Benz Luxury
## Mitsubishi Midrange
## Audi Luxury
## Volkswagen Midrange
## BMW Luxury
## Nissan Midrange
## Lexus Luxury
## Jaguar Luxury
## Land Luxury
## MG Midrange
## Volvo Luxury
## Daewoo Budget
## Kia Midrange
## Fiat Budget
## Force Midrange
## Ambassador Budget
## Ashok Budget
## Isuzu Midrange
## Opel Budget
car_details dataset| Attribute Name | Description | Variable Type |
|---|---|---|
make |
The manufacturer of the vehicle (e.g., Maruti, Hyundai, Honda, etc.). | Factor |
year |
The year the vehicle was manufactured. | Numeric |
selling_price_in_10k |
The selling price of the vehicle in ten-thousands of Indian Rupees (INR). | Numeric |
km_driven_in_10k |
The total distance the vehicle has been driven, measured in ten-thousands of kilometers. | Numeric |
fuel |
The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”). | Factor |
seller_type |
The type of seller (“Individual”, “Dealer”, or “Trustmark Dealer”). | Factor |
transmission |
The type of transmission system (“Manual” or “Automatic”). | Factor |
owner |
The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”). | Factor |
fuel_efficiency |
The fuel efficiency of the vehicle in km/l. | Numeric |
engine |
The engine displacement capacity of the vehicle, measured in cubic centimeters (CC). | Numeric |
max_power |
The maximum power output of the vehicle’s engine, measured in brake horsepower (BHP). | Numeric |
seats |
The seating capacity of the vehicle. | Numeric |
make_category |
Categorized vehicle make (e.g., “Luxury”, “Midrange”, or “Budget”). | Factor |
head(car_details, n=8)
## # A tibble: 8 × 13
## make year selling_price_in_10k km_driven_in_10k fuel seller_type
## <fct> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Maruti 2014 45 14.6 Diesel Individual
## 2 Skoda 2014 37 12 Diesel Individual
## 3 Honda 2006 15.8 14 Petrol Individual
## 4 Hyundai 2010 22.5 12.7 Diesel Individual
## 5 Maruti 2007 13 12 Petrol Individual
## 6 Hyundai 2017 44 4.5 Petrol Individual
## 7 Maruti 2001 4.5 0.5 Petrol Individual
## 8 Toyota 2011 35 9 Diesel Individual
## transmission owner fuel_efficiency engine max_power seats make_category
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 Manual First Owner 23.4 1248 74 5 Midrange
## 2 Manual Second Owner 21.1 1498 104. 5 Midrange
## 3 Manual Third Owner 17.7 1497 78 5 Midrange
## 4 Manual First Owner 23 1396 90 5 Midrange
## 5 Manual First Owner 16.1 1298 88.2 5 Midrange
## 6 Manual First Owner 20.1 1197 81.9 5 Midrange
## 7 Manual Second Owner 16.1 796 37 4 Midrange
## 8 Manual First Owner 23.6 1364 67.1 5 Midrange
car_details$selling_price_in_10k[car_details$selling_price_in_10k > 720]
## [1] 1000
car_details=subset(car_details,subset = car_details$selling_price_in_10k < 720,)
selling_price = 1000.# Histogram of selling prices
hist(car_details$selling_price_in_10k,xlab="Selling Price (in 10,000 INR)",
main="Selling Price distribution",
breaks = 30,
col = "lightblue")
make_counts = table(car_details$make)
fuel_type_car_count = table(car_details$fuel)
seller_type_car_count = table(car_details$seller_type)
trans_type_car_count = table(car_details$transmission)
owner_type_car_count = table(car_details$owner)
par(mfrow = c(2, 3))
# Plot 1: Number of cars by make
barplot(sort(make_counts),horiz=TRUE, las = 1,
xlab = "Number of Cars",
ylab = "Car Make",
col = "lightblue",
cex.names = 0.5,
main = "Num. of Cars in Each Make")
# Plot 2: Number of cars by fuel type
barplot(sort(fuel_type_car_count), horiz = TRUE, las = 1, cex.names = 0.9,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Fuel Type")
# Plot 3: Number of cars by seller type
barplot(sort(seller_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Seller Type")
# Plot 4: Number of cars by transmission type
barplot(sort(trans_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Trans Type")
# Plot 5: Number of cars by owner type
barplot(sort(owner_type_car_count), horiz = TRUE, las = 1, cex.names = 0.6,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Owner Type")
# Boxplot of selling price by car make
boxplot(selling_price_in_10k ~ make, data = car_details,col=rainbow(length(unique(car_details$make))),
las = 2,cex.axis = 0.7,
main = "Boxplot of Selling Price by Car Make",
xlab = "Car Make",
ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by fuel type
boxplot(selling_price_in_10k ~ fuel, data = car_details, cex.axis = 0.8,
col = rainbow(length(unique(car_details$fuel))),
main = "Box Plot of Selling Price by Fuel Type",
xlab = "Fuel Type", ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by seller type
boxplot(selling_price_in_10k ~ seller_type, data = car_details, cex.axis = 0.6,
col = rainbow(length(unique(car_details$seller_type))),
main = "Box Plot of Selling Price by Seller Type",
xlab = "Seller Type", ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by transmission type
plot(selling_price_in_10k ~ transmission, data = car_details,
col = rainbow(length(unique(car_details$transmission))),
main = "Boxplot of Selling Price by Transmission Type",
xlab = "Transmission Type",
ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by owner type
plot(selling_price_in_10k ~ owner, data = car_details, las = 2, cex.axis = 0.5,
col = rainbow(length(unique(car_details$owner))),
main = "Boxplot of Selling Price vs Owner Type",
xlab = "Owner Type",
ylab = "Selling Price (10,000 INR)")
# Scatter plot of selling price vs year and transmission type
colours = ifelse(car_details$transmission == "Automatic", "blue", "red")
plot(selling_price_in_10k ~ year,data=car_details,col=colours,pch=19,
main = "Selling Price vs Year and Transmission Type",
xlab = "Year",
ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Automatic", "Manual"),
col = c("blue", "red"), pch = 19)
# Scatter plot of selling price vs year and make category
colours = c("Budget" = "blue", "Midrange"="green","Luxury"= "orange")
plot(selling_price_in_10k ~ year,data=car_details,
col = colours[car_details$make_category],pch=19,
main = "Selling Price vs Year and Make Category",
xlab = "Year",
ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Budget","Midrange","Luxury"),
col = c("blue", "green","orange"), pch = 19)
# Scatter plot of selling price vs km driven
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
xlab="Km Driven (10,000 km)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Km Driven")
# Subset out extreme values
car_details = subset(car_details,car_details$km_driven_in_10k < 100,)
# Scatter plot of selling price vs km driven after removing extreme values
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
xlab="Km Driven (10,000 km)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Km Driven (extreme values removed)")
# Subset out 15 observations with fuel_efficiency=0
car_details = subset(car_details, car_details$fuel_efficiency!=0,)
# Scatter plot of selling price vs fuel efficiency
plot(selling_price_in_10k ~ fuel_efficiency, data=car_details,
xlab="Fuel Efficiency (km/l)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Fuel Efficiency")
# Scatter plot of selling price vs max power (by fuel type)
colours = c("Petrol" = "blue", "Diesel"="green")
plot(selling_price_in_10k ~ max_power, data=car_details,
col=colours[car_details$fuel],
main="Selling Price vs Max Power (by Fuel Type)",
xlab="Max Power (bhp)",
ylab="Selling Price (10,000 INR)" )
legend("topleft", legend = c("Petrol","Diesel"),
col = c("blue", "green"), pch = 19)
# Pairs plot for numeric variables
pairs(selling_price_in_10k ~ year + km_driven_in_10k + fuel_efficiency + engine + max_power + seats, data = car_details)
# Calculate correlation matrix for numeric variables
cor_mat = cor(car_details[, sapply(car_details, is.numeric)])
cor_mat
## year selling_price_in_10k km_driven_in_10k
## year 1.00000000 0.4395767 -0.45496390
## selling_price_in_10k 0.43957666 1.0000000 -0.19812586
## km_driven_in_10k -0.45496390 -0.1981259 1.00000000
## fuel_efficiency 0.37352047 -0.1258892 -0.24186898
## engine -0.02333154 0.4480476 0.30860425
## max_power 0.15993052 0.6857687 0.04886655
## seats 0.02630866 0.1648269 0.24955889
## fuel_efficiency engine max_power seats
## year 0.3735205 -0.02333154 0.15993052 0.02630866
## selling_price_in_10k -0.1258892 0.44804759 0.68576871 0.16482693
## km_driven_in_10k -0.2418690 0.30860425 0.04886655 0.24955889
## fuel_efficiency 1.0000000 -0.59909460 -0.40078564 -0.48552947
## engine -0.5990946 1.00000000 0.68399138 0.66312580
## max_power -0.4007856 0.68399138 1.00000000 0.25911446
## seats -0.4855295 0.66312580 0.25911446 1.00000000
# Extract high correlation values from the upper triangle of the correlation
# matrix (to remove redundant correlations caused by symmetry)
high_cor_indices = which(upper.tri(cor_mat) & cor_mat > 0.5, arr.ind = TRUE)
# Create table for high-correlation variables
high_cor_df = data.frame(
row = rownames(cor_mat)[high_cor_indices[, 1]],
column = colnames(cor_mat)[high_cor_indices[, 2]],
correlation = cor_mat[high_cor_indices]
)
# Display the first few entries in the table
head(high_cor_df)
## row column correlation
## 1 selling_price_in_10k max_power 0.6857687
## 2 engine max_power 0.6839914
## 3 engine seats 0.6631258
max_power and selling_price have a high
correlation of 0.6872307. Hence, selling price increases with max
power.max_power and engine have a high
correlation of 0.6863027. This indicates that higher-displacement
engines tend to be more powerful.seats and engine have high a correlation
of 0.6631. This indicates that vehicles with more seats tend to have
larger engines.In this section, a model for predicting values of
selling_price based on the values of the other variables is
developed and the performance of the model is analyzed.
set.seed(125) # For reproducibility
train_indices = sample(nrow(car_details), size = 0.80 * nrow(car_details))
train_data = car_details[train_indices, ]
test_data = car_details[-train_indices, ]
nrow(train_data)
## [1] 5289
nrow(test_data)
## [1] 1323
make)# Definition of "full" linear model
full_model = lm(selling_price_in_10k ~ .-make, data = train_data)
# Model summary
summary(full_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ . - make, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -207.237 -11.048 -1.612 8.107 311.373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.288e+03 2.832e+02 -29.269 < 2e-16 ***
## year 4.129e+00 1.419e-01 29.098 < 2e-16 ***
## km_driven_in_10k -8.828e-01 9.577e-02 -9.218 < 2e-16 ***
## fuelPetrol -1.064e+01 1.201e+00 -8.858 < 2e-16 ***
## seller_typeIndividual -7.892e+00 1.299e+00 -6.074 1.33e-09 ***
## seller_typeTrustmark Dealer -4.949e+00 5.549e+00 -0.892 0.372486
## transmissionManual -1.221e+01 1.569e+00 -7.782 8.51e-15 ***
## ownerFourth & Above Owner -2.216e+00 2.671e+00 -0.830 0.406730
## ownerSecond Owner -5.364e+00 9.265e-01 -5.790 7.47e-09 ***
## ownerTest Drive Car 2.421e+02 1.222e+01 19.817 < 2e-16 ***
## ownerThird Owner -3.792e+00 1.542e+00 -2.459 0.013979 *
## fuel_efficiency -6.184e-01 1.814e-01 -3.408 0.000658 ***
## engine 9.462e-03 1.801e-03 5.254 1.55e-07 ***
## max_power 4.977e-01 2.041e-02 24.379 < 2e-16 ***
## seats -8.198e-02 5.853e-01 -0.140 0.888617
## make_categoryLuxury 1.195e+02 4.550e+00 26.274 < 2e-16 ***
## make_categoryMidrange 1.053e+01 3.093e+00 3.403 0.000672 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.97 on 5272 degrees of freedom
## Multiple R-squared: 0.7107, Adjusted R-squared: 0.7098
## F-statistic: 809.5 on 16 and 5272 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(full_model))
## [1] 17
#Leave one out cross validation RMSE
sqrt(mean((resid(full_model) / (1 - hatvalues(full_model))) ^ 2))
## [1] 27.37401
# Diagnostic plots
par(mfrow = c(1, 2))
plot(full_model,which=c(1,2))
# Diagnostic tests
bp_test <- bptest(full_model)
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: full_model
## BP = 1499.3, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(full_model), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.7217, p-value < 2.2e-16
# Multicollinearity test
vif(full_model)
## GVIF Df GVIF^(1/(2*Df))
## year 2.190700 1 1.480101
## km_driven_in_10k 1.634226 1 1.278369
## fuel 2.588377 1 1.608843
## seller_type 1.112922 2 1.027108
## transmission 1.446943 1 1.202889
## owner 1.412292 4 1.044096
## fuel_efficiency 3.683989 1 1.919372
## engine 5.759182 1 2.399830
## max_power 3.026818 1 1.739775
## seats 2.419890 1 1.555599
## make_category 1.562362 2 1.118009
The “full” linear model has a high \(R^2\) value of 0.7212, indicating that
72.12% of the variance in selling_price_in_10k is explained
by the predictors.
Most predictors are statistically significant (p-value < 0.05), suggesting they contribute meaningfully to the model. Exceptions are:
seller_typeTrustmark Dealer (p = 0.33)ownerFourth & Above Owner (p = 0.14)seats (p = 0.48)From this initial model, it is apparent:
year is positively correlated with price (newer cars
sell for higher prices)engine, max_power,
make_category:Luxury are positively correlated with selling
prices (more powerful vehicles and luxury vehicles sell for more)km_driven_in_10k, fuelPetrol, and
fuel_efficiency are negatively correlated with selling
price: higher mileage vehicles and petrol vehicles are associated with
lower selling pricesDiagnostic testing for this model:
The Breusch-Pagan test yielded a very high test statistic of 1538.2, and a p-value less than \(2.2*10^{-16}\), which strongly indicates heteroscedasticity. This is reflected by the “fat tails” in the QQ plot.
The Shapiro-Wilk test yielded a low p-value less than \(2.2*10^{-16}\), which strongly rejects the null hypothesis that the residuals are normally distributed.
For this model, VIF analysis shows:
engine (VIF = 2.36) and fuel_efficiency
(VIF = 1.80) have relatively higher multicollinearity (but they still
within acceptable ranges).train_data$selling_price_in_10k = log(train_data$selling_price_in_10k)
test_data$selling_price_in_10k = log(test_data$selling_price_in_10k)
make) | with BIC# Definition of "full" linear model with BIC-based stepwise model selection starting from full model (both directions)
n= nrow(train_data)
bic_full_add_model = step(lm(selling_price_in_10k ~ .-make, data = train_data),direction="both",k=log(n),trace=0)
# Model summary
summary(bic_full_add_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ (make + year + km_driven_in_10k +
## fuel + seller_type + transmission + owner + fuel_efficiency +
## engine + max_power + seats + make_category) - make, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.68520 -0.16717 0.02508 0.19042 1.42921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.217e+02 3.054e+00 -72.603 < 2e-16 ***
## year 1.112e-01 1.530e-03 72.642 < 2e-16 ***
## km_driven_in_10k -3.850e-03 1.033e-03 -3.728 0.000195 ***
## fuelPetrol -1.369e-01 1.296e-02 -10.569 < 2e-16 ***
## seller_typeIndividual -7.010e-02 1.401e-02 -5.003 5.82e-07 ***
## seller_typeTrustmark Dealer 8.294e-02 5.984e-02 1.386 0.165779
## transmissionManual -9.709e-02 1.692e-02 -5.737 1.02e-08 ***
## ownerFourth & Above Owner -1.212e-01 2.880e-02 -4.209 2.61e-05 ***
## ownerSecond Owner -7.404e-02 9.991e-03 -7.411 1.45e-13 ***
## ownerTest Drive Car 6.114e-01 1.317e-01 4.641 3.55e-06 ***
## ownerThird Owner -1.087e-01 1.663e-02 -6.536 6.90e-11 ***
## fuel_efficiency 1.214e-02 1.956e-03 6.207 5.83e-10 ***
## engine 2.566e-04 1.942e-05 13.213 < 2e-16 ***
## max_power 8.500e-03 2.201e-04 38.616 < 2e-16 ***
## seats 4.647e-02 6.311e-03 7.362 2.09e-13 ***
## make_categoryLuxury 5.779e-01 4.906e-02 11.778 < 2e-16 ***
## make_categoryMidrange 2.194e-01 3.336e-02 6.576 5.28e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2909 on 5272 degrees of freedom
## Multiple R-squared: 0.8474, Adjusted R-squared: 0.8469
## F-statistic: 1830 on 16 and 5272 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(bic_full_add_model))
## [1] 17
#Leave one out cross validation RMSE
sqrt(mean((resid(bic_full_add_model) / (1 - hatvalues(bic_full_add_model))) ^ 2))
## [1] 0.2914511
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_full_add_model,which=c(1,2))
# Diagnostic tests
bp_test = bptest(bic_full_add_model)
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: bic_full_add_model
## BP = 372.41, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_full_add_model), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.98094, p-value < 2.2e-16
# Multicollinearity test
vif(bic_full_add_model)
## GVIF Df GVIF^(1/(2*Df))
## year 2.190700 1 1.480101
## km_driven_in_10k 1.634226 1 1.278369
## fuel 2.588377 1 1.608843
## seller_type 1.112922 2 1.027108
## transmission 1.446943 1 1.202889
## owner 1.412292 4 1.044096
## fuel_efficiency 3.683989 1 1.919372
## engine 5.759182 1 2.399830
## max_power 3.026818 1 1.739775
## seats 2.419890 1 1.555599
## make_category 1.562362 2 1.118009
| Adjusted R^2 | Number of parameters |
Breusch-Pagan test | Shapiro-Wilk test (with 5k sample residuals) |
VIF>5 | Non significant predictors(pvalue > 0.05) |
LOOCV RMSE |
|
|---|---|---|---|---|---|---|---|
| Full additive model with BIC (Excluding “make” predictor) |
0.8469423 | 17 | Constant variance assumption violated with BP = 372.41 |
Normality assumption violated with W = 0.98173 |
None | seller_typeTrustmark Dealer | 0.2914511 |
The “full” linear model has a high \(R^2\) value of 0.8469423, indicating that
84.69% of the variance in selling_price_in_10k is explained
by the predictors.
Most predictors are statistically significant (p-value <
0.05), suggesting they contribute meaningfully to the model. Except
seller_typeTrustmark Dealer (p = 0.165779)
Diagnostic testing for this model:
The Breusch-Pagan test yielded a very high test statistic of 372.41 and a p-value less than \(2.2*10^{-16}\) which rejects the null hypothesis that constant variance exists in residuals
The Shapiro-Wilk test with 5000 sample residuals yielded a low p-value less than \(2.2*10^{-16}\), which rejects the null hypothesis that the residuals are normally distributed.
bic_small_add_model = step(lm(selling_price_in_10k ~ year + make_category + max_power + transmission + fuel + seller_type + owner +engine, data = train_data),direction="both",k=log(n),trace=0)
# Model summary
summary(bic_small_add_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power +
## transmission + fuel + seller_type + owner + engine, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.66616 -0.16658 0.02659 0.19106 1.39914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.362e+02 2.518e+00 -93.801 < 2e-16 ***
## year 1.186e-01 1.248e-03 95.034 < 2e-16 ***
## make_categoryLuxury 5.800e-01 4.917e-02 11.796 < 2e-16 ***
## make_categoryMidrange 2.257e-01 3.358e-02 6.723 1.97e-11 ***
## max_power 7.966e-03 2.098e-04 37.967 < 2e-16 ***
## transmissionManual -8.779e-02 1.696e-02 -5.176 2.36e-07 ***
## fuelPetrol -1.740e-01 9.803e-03 -17.752 < 2e-16 ***
## seller_typeIndividual -7.327e-02 1.407e-02 -5.209 1.97e-07 ***
## seller_typeTrustmark Dealer 7.134e-02 6.028e-02 1.183 0.237
## ownerFourth & Above Owner -1.188e-01 2.899e-02 -4.098 4.23e-05 ***
## ownerSecond Owner -7.980e-02 1.002e-02 -7.962 2.05e-15 ***
## ownerTest Drive Car 5.996e-01 1.327e-01 4.518 6.39e-06 ***
## ownerThird Owner -1.156e-01 1.671e-02 -6.920 5.07e-12 ***
## engine 2.549e-04 1.290e-05 19.762 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2931 on 5275 degrees of freedom
## Multiple R-squared: 0.845, Adjusted R-squared: 0.8446
## F-statistic: 2211 on 13 and 5275 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(bic_small_add_model))
## [1] 14
#Leave one out cross validation RMSE
sqrt(mean((resid(bic_small_add_model) / (1 - hatvalues(bic_small_add_model))) ^ 2))
## [1] 0.2935451
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_small_add_model,which=c(1,2))
# Diagnostic tests
bptest(bic_small_add_model)
##
## studentized Breusch-Pagan test
##
## data: bic_small_add_model
## BP = 356.78, df = 13, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_small_add_model), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.98009, p-value < 2.2e-16
# Multicollinearity test
vif(bic_small_add_model)
## GVIF Df GVIF^(1/(2*Df))
## year 1.434890 1 1.197869
## make_category 1.537469 2 1.113529
## max_power 2.707912 1 1.645573
## transmission 1.431656 1 1.196518
## fuel 1.459458 1 1.208080
## seller_type 1.103696 2 1.024973
## owner 1.393515 4 1.042351
## engine 2.501522 1 1.581620
anova(bic_small_add_model, bic_full_add_model)
## Analysis of Variance Table
##
## Model 1: selling_price_in_10k ~ year + make_category + max_power + transmission +
## fuel + seller_type + owner + engine
## Model 2: selling_price_in_10k ~ (make + year + km_driven_in_10k + fuel +
## seller_type + transmission + owner + fuel_efficiency + engine +
## max_power + seats + make_category) - make
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5275 453.14
## 2 5272 446.00 3 7.1387 28.128 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
indicies_to_exclude = unname(which(cooks.distance(bic_small_add_model) > 4/length(cooks.distance(bic_small_add_model))))
train_data_filtered = train_data[-indicies_to_exclude,]
#Number of excluded observations
nrow(train_data) - nrow(train_data_filtered)
## [1] 283
n= nrow(train_data_filtered)
refitted_bic_small_add_model = lm(formula = formula(bic_small_add_model), data = train_data_filtered)
# Model summary
summary(refitted_bic_small_add_model)
##
## Call:
## lm(formula = formula(bic_small_add_model), data = train_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91665 -0.16073 0.02035 0.17474 0.91654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.398e+02 2.354e+00 -101.878 < 2e-16 ***
## year 1.204e-01 1.167e-03 103.166 < 2e-16 ***
## make_categoryLuxury 5.844e-01 5.026e-02 11.627 < 2e-16 ***
## make_categoryMidrange 2.545e-01 3.380e-02 7.529 6.03e-14 ***
## max_power 8.331e-03 1.909e-04 43.652 < 2e-16 ***
## transmissionManual -7.914e-02 1.522e-02 -5.202 2.05e-07 ***
## fuelPetrol -1.661e-01 8.680e-03 -19.136 < 2e-16 ***
## seller_typeIndividual -8.145e-02 1.250e-02 -6.514 8.02e-11 ***
## seller_typeTrustmark Dealer 3.674e-02 6.880e-02 0.534 0.5934
## ownerFourth & Above Owner -7.936e-02 3.096e-02 -2.563 0.0104 *
## ownerSecond Owner -6.684e-02 8.897e-03 -7.512 6.85e-14 ***
## ownerThird Owner -9.338e-02 1.554e-02 -6.010 1.99e-09 ***
## engine 2.653e-04 1.172e-05 22.648 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.253 on 4993 degrees of freedom
## Multiple R-squared: 0.8685, Adjusted R-squared: 0.8682
## F-statistic: 2747 on 12 and 4993 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(refitted_bic_small_add_model))
## [1] 13
#Leave one out cross validation RMSE
sqrt(mean((resid(refitted_bic_small_add_model) / (1 - hatvalues(refitted_bic_small_add_model))) ^ 2))
## [1] 0.2531199
# Diagnostic plots
par(mfrow = c(1, 2))
plot(refitted_bic_small_add_model,which=c(1,2))
# Diagnostic tests
bptest(refitted_bic_small_add_model)
##
## studentized Breusch-Pagan test
##
## data: refitted_bic_small_add_model
## BP = 251.66, df = 12, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(refitted_bic_small_add_model), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.99136, p-value < 2.2e-16
# Multicollinearity test
vif(refitted_bic_small_add_model)
## GVIF Df GVIF^(1/(2*Df))
## year 1.427903 1 1.194949
## make_category 1.372681 2 1.082412
## max_power 2.453995 1 1.566523
## transmission 1.319415 1 1.148658
## fuel 1.455345 1 1.206377
## seller_type 1.088027 2 1.021315
## owner 1.356970 3 1.052192
## engine 2.415179 1 1.554085
bic_int_model = step(lm(selling_price_in_10k ~ year + make_category + max_power*fuel + transmission*seller_type + owner + engine, data = train_data),direction="both",k=log(n),trace=0)
# Model summary
summary(bic_int_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power +
## fuel + transmission + seller_type + owner + engine + max_power:fuel,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64759 -0.16287 0.02749 0.18978 1.40197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.354e+02 2.522e+00 -93.335 < 2e-16 ***
## year 1.182e-01 1.250e-03 94.613 < 2e-16 ***
## make_categoryLuxury 6.041e-01 4.952e-02 12.198 < 2e-16 ***
## make_categoryMidrange 2.229e-01 3.354e-02 6.646 3.32e-11 ***
## max_power 7.604e-03 2.305e-04 32.985 < 2e-16 ***
## fuelPetrol -2.697e-01 2.730e-02 -9.882 < 2e-16 ***
## transmissionManual -8.613e-02 1.695e-02 -5.082 3.86e-07 ***
## seller_typeIndividual -7.119e-02 1.406e-02 -5.063 4.27e-07 ***
## seller_typeTrustmark Dealer 6.115e-02 6.026e-02 1.015 0.310308
## ownerFourth & Above Owner -1.169e-01 2.896e-02 -4.037 5.49e-05 ***
## ownerSecond Owner -8.069e-02 1.001e-02 -8.059 9.42e-16 ***
## ownerTest Drive Car 5.309e-01 1.338e-01 3.968 7.36e-05 ***
## ownerThird Owner -1.177e-01 1.670e-02 -7.048 2.05e-12 ***
## engine 2.503e-04 1.294e-05 19.345 < 2e-16 ***
## max_power:fuelPetrol 1.125e-03 2.996e-04 3.756 0.000175 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2927 on 5274 degrees of freedom
## Multiple R-squared: 0.8454, Adjusted R-squared: 0.845
## F-statistic: 2060 on 14 and 5274 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(bic_int_model))
## [1] 15
#Leave one out cross validation RMSE
sqrt(mean((resid(bic_int_model) / (1 - hatvalues(bic_int_model))) ^ 2))
## [1] 0.2932303
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_int_model,which=c(1,2))
# Diagnostic tests
bptest(bic_int_model)
##
## studentized Breusch-Pagan test
##
## data: bic_int_model
## BP = 378.51, df = 14, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_int_model), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.97766, p-value < 2.2e-16
# Multicollinearity test
vif(bic_int_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## year 1.442807 1 1.201169
## make_category 1.602363 2 1.125098
## max_power 3.277862 1 1.810487
## fuel 11.341871 1 3.367769
## transmission 1.432631 1 1.196926
## seller_type 1.108617 2 1.026113
## owner 1.422868 4 1.045070
## engine 2.523906 1 1.588680
## max_power:fuel 9.698828 1 3.114294
anova(bic_small_add_model, bic_int_model)
## Analysis of Variance Table
##
## Model 1: selling_price_in_10k ~ year + make_category + max_power + transmission +
## fuel + seller_type + owner + engine
## Model 2: selling_price_in_10k ~ year + make_category + max_power + fuel +
## transmission + seller_type + owner + engine + max_power:fuel
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5275 453.14
## 2 5274 451.93 1 1.2087 14.106 0.0001747 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitted = fitted(refitted_bic_small_add_model)
weights = 1 / (fitted^2)
weighted_model = lm(formula=formula(refitted_bic_small_add_model),data=train_data_filtered,weights=weights)
# Model summary
summary(weighted_model)
##
## Call:
## lm(formula = formula(refitted_bic_small_add_model), data = train_data_filtered,
## weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.31567 -0.04284 0.00591 0.04728 0.33094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.312e+02 1.993e+00 -115.991 < 2e-16 ***
## year 1.161e-01 9.890e-04 117.377 < 2e-16 ***
## make_categoryLuxury 5.689e-01 6.183e-02 9.201 < 2e-16 ***
## make_categoryMidrange 2.486e-01 3.143e-02 7.910 3.13e-15 ***
## max_power 8.558e-03 2.143e-04 39.939 < 2e-16 ***
## transmissionManual -7.947e-02 1.820e-02 -4.367 1.29e-05 ***
## fuelPetrol -1.530e-01 9.200e-03 -16.631 < 2e-16 ***
## seller_typeIndividual -8.964e-02 1.429e-02 -6.272 3.87e-10 ***
## seller_typeTrustmark Dealer 2.574e-02 8.374e-02 0.307 0.759
## ownerFourth & Above Owner -1.082e-01 2.442e-02 -4.430 9.61e-06 ***
## ownerSecond Owner -7.745e-02 8.799e-03 -8.803 < 2e-16 ***
## ownerThird Owner -9.627e-02 1.388e-02 -6.936 4.54e-12 ***
## engine 2.617e-04 1.297e-05 20.175 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07421 on 4993 degrees of freedom
## Multiple R-squared: 0.8962, Adjusted R-squared: 0.8959
## F-statistic: 3592 on 12 and 4993 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(weighted_model))
## [1] 13
#Leave one out cross validation RMSE
sqrt(mean((resid(weighted_model) / (1 - hatvalues(weighted_model))) ^ 2))
## [1] 0.2536116
# Diagnostic plots
#par(mfrow = c(1, 2))
plot(weighted_model,which=c(1,2))
# Diagnostic tests
bptest(weighted_model)
##
## studentized Breusch-Pagan test
##
## data: weighted_model
## BP = 135.18, df = 12, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(weighted_model), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.99066, p-value < 2.2e-16
# Multicollinearity test
vif(weighted_model)
## GVIF Df GVIF^(1/(2*Df))
## year 1.549228 1 1.244680
## make_category 1.226144 2 1.052290
## max_power 2.461797 1 1.569012
## transmission 1.225061 1 1.106825
## fuel 1.577181 1 1.255859
## seller_type 1.072007 2 1.017535
## owner 1.370854 3 1.053979
## engine 2.552408 1 1.597626